jgauthier3710@floridapoly.eduThis report explores the relationship between various attributes and how walkable a certain region of Florida is on average. For this project data was collected from a dataset called the National Walkability Index on the EPA’s website (Environmental Protection Agency, 2021). The dataset was filtered so that only districts in the state of Florida were shown and the dataset was divided into groups based on their CBSA (Core-based statistical area). Next, summary statistics were computed for the region so that the new dataset includes the average National Walkability Index score (AvgNWI), average percentage of population that is working-age (AvgP_Wrk), average district population (AvgDisPop), total district population (TotCPop), and average percentage of low-wage workers (AvgP_LowW) for all the districts in each CBSA region. Each of these attributes serves a distinct purpose in evaluating aspects related to walkability, demographics, and economic characteristics within the selected regions. I predict that areas with larger average working-aged and smaller average low-waged population percentages will have higher walkability scores on average.
My original plan was to make three plots, an interactive scatter plot, a choropleth map, and a heatmap. In the end, I decided to make these plots in addition to a coefficients plot. My first plot is an interactive scatter plot of AvgP_Wrk vs AvgP_LowW with AvgNWI color-coded. The second figure that I made is a choropleth map that shows the AvgNWI across different CBSA regions in Florida. Next, I made a coefficients plot from a multiple linear regression model predicting AvgNWI. I was motivated to create this plot because I wanted to see how various variables impact AvgNWI. Lastly, I made a heatmap showing the correlations between AvgNWI and the other selected attributes.
library(tidyverse)
library(sf)
library(leaflet)
library(terra)
library(htmlwidgets)
library(plotly)
library(broom)
walkability <- read_csv("../data/EPA_SmartLocationDatabase_V3_Jan_2021_Final.csv")
Rows: 220740 Columns: 117
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): CSA_Name, CBSA_Name
dbl (115): OBJECTID, GEOID10, GEOID20, STATEFP, COUNTYFP, TRACTCE, BLKGRPCE, CSA, CBSA, CBSA_POP, CBSA_EMP, CBSA_WRK, Ac_Total, Ac...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdb_path <- "../data/Natl_WI.gdb"
layers <- st_layers(gdb_path)
layer_name <- layers$name[1]
nwi <- st_read(gdb_path, layer = layer_name)
Reading layer `NationalWalkabilityIndex' from data source
`C:\Users\Jackie\Downloads\dataviz_mini-project_02\dataviz_mini-project_02\dataviz_mini-project_02\data\Natl_WI.gdb'
using driver `OpenFileGDB'
Simple feature collection with 220739 features and 29 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -10434580 ymin: -83867.97 xmax: 3407868 ymax: 6755033
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic_USGS_version
flwalkability <- walkability %>%
filter(STATEFP == '12') %>%
group_by(CBSA_Name) %>%
summarize(AvgNWI = mean(NatWalkInd, na.rm = TRUE),
AvgP_Wrk = mean(P_WrkAge, na.rm = TRUE),
AvgDisPop = mean(TotPop, na.rm = TRUE),
TotCPop = sum(TotPop, na.rm = TRUE),
AvgP_LowW = mean(R_PCTLOWWAGE, na.rm = TRUE)
)
flwalkability
# Join the summarized data to the map for Florida
florida_nwi <- nwi[nwi$STATEFP == '12', ]
florida_nwi <- florida_nwi[!is.na(florida_nwi$NatWalkInd), ]
users_map <- florida_nwi %>%
left_join(flwalkability, by = "CBSA_Name")
users_map
Simple feature collection with 11442 features and 34 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 796752.4 ymin: 259071.7 xmax: 1612207 ymax: 961154.4
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic_USGS_version
First 10 features:
GEOID10 GEOID20 STATEFP COUNTYFP TRACTCE BLKGRPCE CSA CSA_Name CBSA
1 121170221051 121170221051 12 117 022105 1 422 Orlando-Lakeland-Deltona, FL 36740
2 120710104093 120710104093 12 071 010409 3 163 Cape Coral-Fort Myers-Naples, FL 15980
3 120710104104 120710104104 12 071 010410 4 163 Cape Coral-Fort Myers-Naples, FL 15980
4 120710104103 120710104103 12 071 010410 3 163 Cape Coral-Fort Myers-Naples, FL 15980
5 120860003071 120860003071 12 086 000307 1 370 Miami-Port St. Lucie-Fort Lauderdale, FL 33100
6 120950147031 120950147031 12 095 014703 1 422 Orlando-Lakeland-Deltona, FL 36740
7 120330036144 120330036144 12 033 003614 4 426 Pensacola-Ferry Pass, FL-AL 37860
8 120339900000 120339900000 12 033 990000 0 426 Pensacola-Ferry Pass, FL-AL 37860
9 120710104041 120710104041 12 071 010404 1 163 Cape Coral-Fort Myers-Naples, FL 15980
10 120710104042 120710104042 12 071 010404 2 163 Cape Coral-Fort Myers-Naples, FL 15980
CBSA_Name Ac_Total Ac_Water Ac_Land Ac_Unpr TotPop CountHU HH Workers D2B_E8MIXA
1 Orlando-Kissimmee-Sanford, FL 377.49665 0.00000 377.49665 377.49665 1571 747 643 1012 0.6101164
2 Cape Coral-Fort Myers, FL 691.21604 25.29563 665.92042 665.92042 1880 877 688 876 0.5984436
3 Cape Coral-Fort Myers, FL 625.08763 53.62688 571.46076 571.46076 1875 807 807 901 0.5047176
4 Cape Coral-Fort Myers, FL 1610.68231 191.16595 1419.51636 1419.51636 4061 1971 1582 1632 0.5229428
5 Miami-Fort Lauderdale-Pompano Beach, FL 99.85187 0.00000 99.85187 99.85187 1658 384 364 619 0.2788802
6 Orlando-Kissimmee-Sanford, FL 423.49975 18.13615 405.36360 405.36360 2491 1155 982 1395 0.5964563
7 Pensacola-Ferry Pass-Brent, FL 3826.38748 15.06668 3811.32081 3807.37991 2032 688 524 563 0.7117882
8 Pensacola-Ferry Pass-Brent, FL 75522.48436 75522.48436 0.00000 0.00000 0 0 0 0 0.0000000
9 Cape Coral-Fort Myers, FL 816.05813 70.99751 745.06061 745.06061 2809 1151 905 1224 0.6974998
10 Cape Coral-Fort Myers, FL 804.52992 56.96297 747.56695 747.56695 2297 1073 938 1297 0.5069232
D2A_EPHHM D3B D4A D2A_Ranked D2B_Ranked D3B_Ranked D4A_Ranked NatWalkInd Shape_Length Shape_Area AvgNWI AvgP_Wrk
1 0.3411523 91.587780 -99999.00 6 12 13 1 7.666667 5858.584 1527709 10.190048 0.6075743
2 0.4112783 40.060523 -99999.00 8 12 8 1 6.333333 6683.597 2797313 9.420233 0.5153093
3 0.5293845 69.835907 -99999.00 11 8 11 1 7.166667 6436.426 2529690 9.420233 0.5153093
4 0.2998326 46.905440 807.35 5 9 9 14 10.000000 10926.322 6518339 9.420233 0.5153093
5 0.1725376 126.068339 355.40 2 3 16 17 11.833333 2543.113 404094 12.707212 0.5918418
6 0.5454343 97.397991 584.73 12 11 14 15 13.500000 5224.096 1713882 10.190048 0.6075743
7 0.7462556 6.047835 -99999.00 17 16 4 1 7.166667 18751.923 15485177 9.039653 0.5996245
8 0.0000000 0.000000 -99999.00 1 1 1 1 1.000000 128518.435 305636842 9.039653 0.5996245
9 0.5202777 53.853551 -99999.00 11 15 10 1 8.000000 8180.276 3302543 9.420233 0.5153093
10 0.2965802 77.940310 -99999.00 5 8 12 1 6.500000 7837.987 3255888 9.420233 0.5153093
AvgDisPop TotCPop AvgP_LowW Shape
1 2937.963 2450261 0.2433612 MULTIPOLYGON (((1433116 731...
2 1398.208 718679 0.2486237 MULTIPOLYGON (((1398559 499...
3 1398.208 718679 0.2486237 MULTIPOLYGON (((1398823 497...
4 1398.208 718679 0.2486237 MULTIPOLYGON (((1396073 493...
5 1775.130 6070944 0.2214451 MULTIPOLYGON (((1589877 448...
6 2937.963 2450261 0.2433612 MULTIPOLYGON (((1419643 714...
7 1791.688 481964 0.2577461 MULTIPOLYGON (((825795.4 87...
8 1791.688 481964 0.2577461 MULTIPOLYGON (((813914 8334...
9 1398.208 718679 0.2486237 MULTIPOLYGON (((1399432 501...
10 1398.208 718679 0.2486237 MULTIPOLYGON (((1400114 499...
# Create the base ggplot
my_plot <- ggplot(
data = flwalkability,
mapping = aes(x = AvgP_Wrk, y = AvgP_LowW, color = AvgNWI)) +
geom_point(aes(text = paste(
"CBSA Name: ", CBSA_Name, "<br>",
"Average District Population: ", AvgDisPop
)), size = 4) +
scale_color_viridis_c() +
labs(
title = "Average Portion of the Population that is Working Age vs Low Wage",
x = "Average Portion of the Population that is Working Age",
y = "",
color = "AvgNWI"
) +
theme_minimal()
Warning in geom_point(aes(text = paste("CBSA Name: ", CBSA_Name, "<br>", :
Ignoring unknown aesthetics: text
# Convert the ggplot to an interactive plotly plot
interactive_plot <- ggplotly(my_plot, tooltip = "text")
interactive_plot
saveWidget(interactive_plot, "interactive_plot.html")
The original plan for the plot shown above is to make an interactive plot that shows the relationship between the average working-age population (AvgP_Wrk) and average low-wage population (AvgP_LowW) across different CBSA regions in Florida, with points color-coded by the average National Walkability Index (AvgNWI). I wanted to display the CBSA name of the point and the average district population when I hover my mouse over the points. To make this plot interactive, plotly was used. This plot was the easiest plot to create and I did not encounter any difficulties creating it. An additional approach I could implement to explore this data further is to add a trend line to highlight patterns in the scatterplot. I could also add more information when I hover over each point.
This plot allows us to exploration of how characteristics like working-age population and low-wage employment correlate across different areas. Additionally, this plot tells the story of how these characteristics affect the national walkability index scores. From the graph, it appears that low-wage employment is negatively correlated with walkability, and the working-age population is positively correlated with walkability. This information can be used to influence policies related to employment and urban planning. One way I applied data science principles to this plot is I keeping the color the same for the AvgNWI variable as the second plot. Additionally, the graph is kept minimal and the points are sized so that they are easy to interact with. The labels on each point are easy to interact with and understand.
ggplot(data = users_map) +
geom_sf(aes(fill = AvgNWI), size = 0.1, color = "gray80") +
scale_fill_viridis_c(option = "viridis", name = "Walkability Index") +
labs(title = "Average National Walkability Index in CBSA Regions of Florida",
caption = "The Gray areas on the Graph represent areas without walkability Data") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.caption = element_text(hjust = 1)
) +
guides(
fill = guide_colorbar(
title.position = "top",
title.hjust = 0.5,
title.vjust = 1,
title.theme = element_text(size = 8)
))
The original chart planned for this figure was a choropleth map displaying the average National Walkability Index (AvgNWI) across CBSA regions in Florida. For this chart, the walkability data was merged with the geographic data for the state of Florida. To make this graph, I tested a variety of colors to see which color would suit the missing data the best. I decided that light gray would be the best color because it was the least distracting. Figuring out the best color scheme was the main difficulty that I encountered when making this graph. Additionally, I initially had some difficulty merging the data to create this graph. One thing I could add to the graph is a label for the most walkable area.
This map tells the story of how walkable different regions of Florida are on average. This plot can be used by policymakers to identify regions with high walkability scores in Florida. They could then look into the urban planning policies in areas with a high walkability score. The graph used a variety of data visualization principles including keeping a consistent color scheme/gradient to represent the AvgNWI values effectively. Additionally, the design is kept minimal so the data can be focused on the map.
# Load necessary libraries
library(broom)
# Fit multiple linear regression model
model <- lm(AvgNWI ~ AvgP_Wrk + AvgDisPop + TotCPop + AvgP_LowW, data = flwalkability)
# Use broom::tidy to extract coefficients and their confidence intervals
coefficients <- tidy(model, conf.int = TRUE) %>%
filter(term != "(Intercept)") # Remove intercept from plotting
# Plot coefficients with confidence intervals
ggplot(coefficients, aes(x = estimate, y = fct_rev(term))) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 0, color = "violet") +
labs(
title = "Coefficients of Multiple Linear Regression Model",
x = "Coefficient",
y = ""
) +
theme_minimal()
This plot displays the coefficients with confidence intervals from a multiple linear regression model predicting AvgNWI using AvgP_Wrk, AvgDisPop, TotCPop, and AvgP_LowW. To make this plot, a multiple linear regression model called model was created that used AvgNWI as the dependent variable and the other variables as predictors. Then the coefficients and their confidence intervals were extracted. The main difficulty that I encountered creating this plot was deciding which plot based on a model that I wanted to create. One additional piece of information that I could add to the plots is the p-values. I was motivated to create this plot because using a multiple linear regression model is an excellent way to understand the impact each predictor variable has on the AvgNWI variable. This approach not only quantifies their impact but also aids in predicting AvgNWI for new areas or scenarios.
This plot tells the story of how much of an impact each of the predictor variables has on the average National Walkability Index value (AvgNWI). By displaying the coefficients, the plot provides insights into which factors most strongly influence the walkability of an area. This plot visually represents the impact of each predictor variable on the National Walkability Index (AvgNWI), providing insights into which factors most strongly influence walkability. The plot shows that the AvgP_Wrk and AvgP_LowW have the most impact on the walkability with the other two variables having zero impact on the walkability. One data visualization principle that was applied in this graph is minimalism. Additionally, I used color coding to draw the viewer’s attention to the zero coefficient line.
The original figure planned here was a heatmap displaying the correlation matrix among AvgNWI, AvgP_Wrk, AvgDisPop, TotCPop, and AvgP_LowW. For this heat map, the correlation coefficient was first computed, converted into a long format, and rounded for ggplot visualization. This plot visualizes the strength and direction of correlations among variables and helps us identify potential relationships and dependencies. The main difficulty that I encountered with this plot was deciding whether or not to include it. In the end, I decided I would keep it because it adds to the previous graphs.
This plot tells the story of how correlated the variables being analyzed in this report are to each other. Some of the data visualization principles that are applied in this graph include utilizing a color scheme that is good for positive versus negative values. Additionally, I used a minimal theme and only labeled parts that needed labels.
The findings and visualizations generally confirm assumptions about the relationships between walkability and the predictor variables. Higher walkability tends to correlate with higher working-age populations percentages and slightly lower proportions of low-wage workers. This suggests that areas with better walkability might attract a more economically active and potentially higher-earning population. Higher walkability areas also tend to have higher populations.
U.S. Environmental Protection Agency. (2021). National Walkability Index. Data.gov. Retrieved June 15, 2024, from https://catalog.data.gov/dataset/walkability-index1/
U.S. Environmental Protection Agency. (2021). Smart location mapping. Retrieved June 15, 2024, from https://www.epa.gov/smartgrowth/smart-location-mapping
Healy, K. (2019). Data visualization: A practical introduction. Retrieved from https://socviz.co/refineplots.html#refineplots